Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding MultiModal ShardedDataloader #262

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

ashvinnihalani
Copy link

@ashvinnihalani ashvinnihalani commented Nov 19, 2024

Description

Adding a MultiModal Sharded Dataloader and several of the samplers to be used

Additional context

ShardedDataloader is a flexable multimodal dataloader that attempts to solve the use case of large scale llm/vlm training. A single backed instance of S3 dataloader is not sufficient for the following reasons will face the following issues which this dataloader attempts to address:

  • PreDownloading Times: Often times other dataloaders implementations either require you to either predownload parts of whole of the datasets. Even datasets implementation like IndexedDataset require you to generate index files either ahead of time or during runtime which is expensive
  • Fast Access: MMap datasets are often infeasable when a customer want to combine many seperate datasets. Without Mmap, access times are slow and with MMAP you can hit the open files limit of the system
  • Flexibility: Dataloaders sometimes make the trade off and sacrifice performance by pretokenize/preembed multimodal components. This doesn't scale to complex multimodal types like video and often requires individual to reprocess their data every time a data prep step changes
  • I have updated the CHANGELOG or README if appropriate

Related items

Testing


By submitting this pull request, I confirm that my contribution is made under the terms of BSD 3-Clause License and I agree to the terms of the LICENSE.

Comment on lines +325 to +326
sub_shards.extend(shards[i])
if len(sub_shards) == num_uri_merge:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are expending sub_shards by a list, isn't it possible that size of sub_shards exceeds num_uri_merge, so the condition will be false and we continue to expand sub_shards?

class S3ShardSampler(ShardSampler, pl.core.hooks.CheckpointHooks):
def __init__(self, uri: str, glob: Optional[str] = None, recursive: bool = True, num_uri_merge: int = 0):
s3_client = self._get_client()
self.shards: List[str] = list(get_objects_from_uris(uri, s3_client) # type: ignore
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it missing a closing parenthesis?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, what is the idea behind that line? uri represents one object, so get_objects_from_uris will return one S3BucketKeyData that has passed uri parsed way to represent bucket and key. As I understood self.shards is not expected to get S3BucketKeyData.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants